Install and load necessary packages
library(palmerpenguins)
library(ggplot2)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly) #for interactive plots
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Question 1 (Q1a) Creating a figure using the Palmer Penguin data set that is correct but badly communicates the data
#plotting a scatter graph for bill length (x axis) against flipper length (y axis). High point size of 30 and reversed x axis
ggplot(data=penguins,
aes(x=bill_length_mm,
y=flipper_length_mm), colour=species)+
geom_point(size=30)+ #increasing the size of points
scale_x_reverse() + #reversing the x axis
labs(x="Bill Length",
y="Flipper Length")
## Warning: Removed 2 rows containing missing values (`geom_point()`).
(Q1b) Write about how your design choices mislead the reader about the underlying data (200-300 words)
Question 2 (Q1a) Write a data analysis pipeline in your .rmd RMarkdown file. You should be aiming to write a clear explanation of the steps as well as clear code
Introduction
The data set contains size measurements for three species of penguins found on three islands in the Palmer Archipelago, Antarctica (Gorman et al 2014). This data pipeline aims to explore the Palmer Penguin data set further; particularly investigating the interspecific variation in flipper length (i.e. looking to see if flipper length varies due to species).
The data must first be cleaned appropriately.
# View the structure of the dataset
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
# Check for missing values in the original dataset
summary(is.na(penguins))
## species island bill_length_mm bill_depth_mm
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:344 FALSE:344 FALSE:342 FALSE:342
## TRUE :2 TRUE :2
## flipper_length_mm body_mass_g sex year
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:342 FALSE:342 FALSE:333 FALSE:344
## TRUE :2 TRUE :2 TRUE :11
# Remove NAs from 'species' and 'flipper_length_mm', and ensure flipper length is always positive (i.e. checking data integrity)
cleaned_penguins <- penguins[complete.cases(penguins$species, penguins$flipper_length_mm) & penguins$flipper_length_mm >= 0, ]
# Check the cleaned dataset
summary(is.na(cleaned_penguins))
## species island bill_length_mm bill_depth_mm
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:342 FALSE:342 FALSE:342 FALSE:342
##
## flipper_length_mm body_mass_g sex year
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:342 FALSE:342 FALSE:333 FALSE:342
## TRUE :9
Secondly, the data must be explored. This can be achieved by creating an exploratory figure
# Jitter plot (exploratory figure) for flipper length of each species. Alpha = 0.7 ensures points have some transparency to avoid over-plotting. Species identified by colour.
ggplot(data = cleaned_penguins, aes(x = species, y = flipper_length_mm)) +
geom_jitter(aes(color = species),
width = 0.2,
alpha = 0.7) +
labs(title = "Jitter Plot Showing the Distribution of Flipper Lengths by Species",
x="Species",
y="Flipper Length (mm)")+
scale_color_manual(values = c("darkorange","darkgreen","blue3"))
Hypothesis:
H0: There is no difference in mean flipper length between species
H1: At least one of the species’s mean flipper length significantly differs from at least one other species’s mean flipper length
Based on the exploratory figure, I predict that the flipper length of Gentoo penguins will significantly differ from both Chinstrap and Adelie. Therefore, I predict that there is a difference in mean flipper length between species.
Statistical Methods:
As the investigation is between a categorical explanatory variable (i.e. species) and continuous response variable (i.e. flipper length), an Analysis of Variance (ANOVA) will be used. The ANOVA is used to test whether individuals chosen from different groups, on average, are more different than individuals chosen from the same group. In this case, investigating whether individuals from different species are, on average, more different regarding flipper length than individuals from the same species. There are four assumptions of ANOVA: normality, homogeneity of variances, independence, and random sampling. These are assumed to be met for the analysis.
#Fitting a linear model to the cleaned data
flipper_mod1 <- lm(flipper_length_mm ~ species, cleaned_penguins)
summary(flipper_mod1) #showing summary of linear model
##
## Call:
## lm(formula = flipper_length_mm ~ species, data = cleaned_penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9536 -4.8235 0.0464 4.8130 20.0464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 189.9536 0.5405 351.454 < 2e-16 ***
## speciesChinstrap 5.8699 0.9699 6.052 3.79e-09 ***
## speciesGentoo 27.2333 0.8067 33.760 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.642 on 339 degrees of freedom
## Multiple R-squared: 0.7782, Adjusted R-squared: 0.7769
## F-statistic: 594.8 on 2 and 339 DF, p-value: < 2.2e-16
#Running the ANOVA function
anova(flipper_mod1)
## Analysis of Variance Table
##
## Response: flipper_length_mm
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 52473 26236.6 594.8 < 2.2e-16 ***
## Residuals 339 14953 44.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on this ANOVA, a post-hoc comparison can be conducted to identify which species are different from each other. The post-hoc comparison is a Tukey-Kramer test (also known as a Tukey Honest Significance Test). It is also called a Tukey HSD for short.
#Run the TukeyHSD
TukeyHSD(aov(flipper_mod1))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = flipper_mod1)
##
## $species
## diff lwr upr p adj
## Chinstrap-Adelie 5.869887 3.586583 8.153191 0
## Gentoo-Adelie 27.233349 25.334376 29.132323 0
## Gentoo-Chinstrap 21.363462 19.000841 23.726084 0
Results:
Box plots can be used to visualise the results of the ANOVA. This is because the box plot demonstrates the distribution of each group (species) and shows and potential differences in central tendency and spread.
#Box plot for flipper length and species. Customizing box plot appearance by changing colour, point size, point transparency, and jitter
ggplot(data = cleaned_penguins, aes(x = species, y = flipper_length_mm, colour = species, alpha = 0.7)) +
geom_boxplot(fill = "white", color = "black") + # Customize box plot appearance
geom_point(position = position_jitter(), size = 2) + # Customize point appearance
scale_color_manual(values = c("darkorange", "darkgreen", "blue3")) + # Set custom colors
labs(title = "Distribution of Flipper Lengths by Species",
x = "Penguin Species", y = "Flipper Length (mm)") +
theme_minimal()
#Making an Interactive Box plot
boxplot <- ggplot(data = cleaned_penguins, aes(x = species, y = flipper_length_mm, color = species, alpha = 0.8)) +
geom_boxplot(fill = "white", color = "black") +
geom_point(position = position_jitter(), size = 2) +
scale_color_manual(values = c("darkorange", "darkgreen", "blue3")) +
labs(title = "Distribution of Flipper Lengths by Species",
x = "Penguin Species", y = "Flipper Length (mm)") +
theme_minimal()
# Convert ggplot to plot_ly
interactive_boxplot <- ggplotly(boxplot)
# Print the interactive box plot
interactive_boxplot
#Viewers can interact with the plot by rolling over elements to highlight and identify individual points
An alternative visualisation method is the violin plot. Violin plots can also be used to visualise the results of the ANOVA as they show the distribution of the data along with the summary statistics and are useful for comparing different groups, which in this case is species.
#Violin plot for flipper length and species. Customizing violin plot appearance by changing colour, point size, point transparency, and jitter
ggplot(data = cleaned_penguins, aes(x = species, y = flipper_length_mm, fill = species, alpha = 0.7)) +
geom_violin(color = "black", scale = "width", trim = FALSE) + # Use geom_violin for violin plot
geom_point(position = position_jitter(width = 0.2), size = 2) + # Customize point appearance
scale_fill_manual(values = c("darkorange", "darkgreen", "blue3")) + # Set custom colors
labs(title = "Distribution of Flipper Lengths by Species",
x = "Penguin Species", y = "Flipper Length (mm)") +
theme_minimal()
#Making an Interactive Violin plot
violin <- ggplot(data = cleaned_penguins, aes(x = species, y = flipper_length_mm, fill = species, alpha = 0.8)) +
geom_violin(color = "black", scale = "width", trim = FALSE) + # Use geom_violin for violin plot
geom_point(position = position_jitter(width = 0.2), size = 2) + # Customize point appearance
scale_fill_manual(values = c("darkorange", "darkgreen", "blue3")) + # Set custom colors
labs(title = "Distribution of Flipper Lengths by Species",
x = "Penguin Species", y = "Flipper Length (mm)") +
theme_minimal()
# Convert ggplot to plot_ly
interactive_violin <- ggplotly(violin)
# Print the interactive plot
interactive_violin
#Viewers can interact with the plot by rolling over elements to highlight and identify individual points
Discussion:
The coefficient table demonstrate that the mean flipper length for: Adelie penguins is 189.954 mm, Chinstrap penguins is 195.824 mm, and Gentoo penguins is 217.187 mm. Ecologically, Gentoo penguins are deep divers whereas Adelie and Chinstrap penguins are shallow divers. This behavioural segregation appears to be anatomically reflected; with Gentoo penguinis having longer flipper lengths to support them in deeper swimming (Trivelpiece et al 1987). The adjusted R-squred value of 0.777 indicates that 78% of the variation in the dataset is explained by differences between species. This is high for biological data.
The ANOVA table shows a p value of less than 0.05. This shows that there is evidence to reject the null hypothesis because our p-value is less than 0.05 (conventional alpha level). This means that at least one of the species of penguin differs from at least one other species of penguin. However, it is important to note that as these penguin species share phylogenetic relatedness, it is not possible to guarantee independence of points (one of the key assumptions of ANOVA) (Tarroux et al 2018).
To identify which species are different from each other, a Tukey HSD was conducted. The results show that for each pairwise comparison the p-value is less than 0.05. This means that the null hypothesis is rejected (i.e. two group means are the same). Therefore, every species differs in flipper length significant from every other species (i.e. all three species differ from each other significantly).
These results make biological sense as different species have different anatomical/physical/physiological adaptations. As such, each species of penguin is expected to have different flipper size as they may be adapted to different ecological niches.
Conclusion:
Based on the results and analysis, we can conclude that flipper length varies between species significantly. This intraspecific variation is biologically coherent as ecological discrepencies between species lead to anatomical adaptations and differences. In order of shortest to longest flipper length, the species are: Adelie, Chinstrap, and Gentoo.
References:
Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081
Tarroux, A., Lydersen, C., Trathan, P. N., & Kovacs, K. M. (2018). Temporal variation in trophic relationships among three congeneric penguin species breeding in sympatry. Ecology and evolution, 8(7), 3660–3674. https://doi.org/10.1002/ece3.3937
Trivelpiece, W. Z., Trivelpiece, S. G., & Volkman, N. J. (1987). Ecological Segregation of Adelie, Gentoo, and Chinstrap Penguins at King George Island, Antarctica. Ecology, 68(2), 351–361. https://doi.org/10.2307/1939266